Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 13 de 13
Filtrar
1.
JMIR Form Res ; 8: e53241, 2024 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-38648097

RESUMO

BACKGROUND: Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data. OBJECTIVE: This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected. METHODS: We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients. RESULTS: The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data. CONCLUSIONS: We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.

2.
Sci Rep ; 14(1): 6978, 2024 03 24.
Artigo em Inglês | MEDLINE | ID: mdl-38521806

RESUMO

Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.


Assuntos
Benchmarking , Revelação , Simulação por Computador , Modelos Logísticos , Processos Mentais
3.
JCO Clin Cancer Inform ; 7: e2300116, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-38011617

RESUMO

PURPOSE: There is strong interest from patients, researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data for secondary analysis. However, data access remains a challenge because of concerns about patient privacy. It has been argued that synthetic data generation (SDG) is an effective way to address these privacy concerns. There is a dearth of evidence supporting this on oncology clinical trial data sets, and on the utility of privacy-preserving synthetic data. The objective of the proposed study is to validate the utility and privacy risks of synthetic clinical trial data sets across multiple SDG techniques. METHODS: We synthesized data sets from eight breast cancer clinical trial data sets using three types of generative models: sequential synthesis, conditional generative adversarial network, and variational autoencoder. Synthetic data utility was evaluated by replicating the published analyses on the synthetic data and assessing concordance of effect estimates and CIs between real and synthetic data. Privacy was evaluated by measuring attribution disclosure risk and membership disclosure risk. RESULTS: Utility was highest using the sequential synthesis method where all results were replicable and the CI overlap most similar or higher for seven of eight data sets. Both types of privacy risks were low across all three types of generative models. DISCUSSION: Synthetic data using sequential synthesis methods can act as a proxy for real clinical trial data sets, and simultaneously have low privacy risks. This type of generative model can be one way to enable broader sharing of clinical trial data.


Assuntos
Neoplasias da Mama , Privacidade , Humanos , Feminino , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/terapia , Oncologia , Pesquisadores
4.
BMC Med Res Methodol ; 23(1): 67, 2023 03 23.
Artigo em Inglês | MEDLINE | ID: mdl-36959532

RESUMO

Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health's administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.


Assuntos
Revelação , Privacidade , Humanos , Coleta de Dados
5.
JAMIA Open ; 5(4): ooac083, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-36238080

RESUMO

Background: One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. Objective: Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. Materials and methods: We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. Results: The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. Conclusions: Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.

6.
PLoS One ; 17(6): e0269097, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35714132

RESUMO

BACKGROUND: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population. OBJECTIVES: Develop an accurate risk estimator for the sample-to-population attack. METHODS: A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature. RESULTS: Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset. CONCLUSIONS: The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.


Assuntos
COVID-19 , COVID-19/epidemiologia , Humanos , Disseminação de Informação , Privacidade , Probabilidade , Risco
7.
Support Care Cancer ; 30(9): 7397-7406, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-35614153

RESUMO

PURPOSE: Machine learning (ML) is a powerful tool for interrogating datasets and learning relationships between multiple variables. We utilized a ML model to identify those early breast cancer (EBC) patients at highest risk of developing severe vasomotor symptoms (VMS). METHODS: A gradient boosted decision model utilizing cross-sectional survey data from 360 EBC patients was created. Seventeen patient- and treatment-specific variables were considered in the model. The outcome variable was based on the Hot Flush Night Sweats (HFNS) Problem Rating Score, and individual scores were dichotomized around the median to indicate individuals with high and low problem scores. Model accuracy was assessed using the area under the receiver operating curve, and conditional partial dependence plots were constructed to illustrate relationships between variables and the outcome of interest. RESULTS: The model area under the ROC curve was 0.731 (SD 0.074). The most important variables in the model were as follows: the number of hot flashes per week, age, the prescription, or use of drug interventions to manage VMS, whether patients were asked about VMS in routine follow-up visits, and the presence or absence of changes to breast cancer treatments due to VMS. A threshold of 17 hot flashes per week was identified as being more predictive of severe VMS. Patients between the ages of 49 and 63 were more likely to report severe symptoms. CONCLUSION: Machine learning is a unique tool for predicting severe VMS. The use of ML to assess other treatment-related toxicities and their management requires further study.


Assuntos
Neoplasias da Mama , Fogachos , Neoplasias da Mama/tratamento farmacológico , Estudos Transversais , Feminino , Fogachos/induzido quimicamente , Humanos , Aprendizado de Máquina , Menopausa , Pessoa de Meia-Idade , Sudorese
8.
JMIR Med Inform ; 10(4): e35734, 2022 Apr 07.
Artigo em Inglês | MEDLINE | ID: mdl-35389366

RESUMO

BACKGROUND: A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. OBJECTIVE: This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. METHODS: We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. RESULTS: The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. CONCLUSIONS: This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.

9.
BMC Med Res Methodol ; 22(1): 46, 2022 02 16.
Artigo em Inglês | MEDLINE | ID: mdl-35172746

RESUMO

BACKGROUND: Two-stage least square [2SLS] and two-stage residual inclusion [2SRI] are popularly used instrumental variable (IV) methods to address medication nonadherence in pragmatic trials with point treatment settings. These methods require assumptions, e.g., exclusion restriction, although they are known to handle unmeasured confounding. The newer IV-method, nonparametric causal bound [NPCB], showed promise in reducing uncertainty compared to usual IV-methods. The inverse probability-weighted per-protocol [IP-weighted PP] method is useful in the same setting but requires different assumptions, e.g., no unmeasured confounding. Although all of these methods are aimed to address the same nonadherence problem, comprehensive simulations to compare performances of them are absent in the literature. METHODS: We performed extensive simulations to compare the performances of the above methods in addressing nonadherence when: (1) exclusion restriction satisfied and no unmeasured confounding, (2) exclusion restriction is met but unmeasured confounding present, and (3) exclusion restriction is violated. Our simulations varied parameters such as, levels of adherence rates, unmeasured confounding, and exclusion restriction violations. Risk differences were estimated, and we compared performances in terms of bias, standard error (SE), mean squared error (MSE), and 95% confidence interval coverage probability. RESULTS: For setting (1), 2SLS and 2SRI have small bias and nominal coverage. IP-weighted PP outperforms these IV-methods in terms of smaller MSE but produces high MSE when nonadherence is very high. For setting (2), IP-weighted-PP generally performs poorly compared to 2SLS and 2SRI in term of bias, and both-stages adjusted IV-methods improve precision than naive IV-methods. For setting (3), IV-methods perform worst in all scenarios, and IP-weighted-PP produces unbiased estimates and small MSE when confounders are adjusted. NPCB produces larger uncertainty bound width in almost all scenarios. We also analyze a two-arm trial to estimate vitamin-A supplementation effect on childhood mortality after addressing nonadherence. CONCLUSIONS: Understanding finite sample characteristics of these methods will guide future researchers in determining suitable analysis strategies. Since assumptions are different and often untestable for IP-weighted PP and IV methods, we suggest analyzing data using both IP-weighted PP and IV approaches in search of a robust conclusion.


Assuntos
Ensaios Clínicos Pragmáticos como Assunto , Cooperação e Adesão ao Tratamento , Viés , Causalidade , Criança , Simulação por Computador , Fatores de Confusão Epidemiológicos , Humanos , Análise dos Mínimos Quadrados
10.
BMJ Open ; 11(4): e043497, 2021 04 16.
Artigo em Inglês | MEDLINE | ID: mdl-33863713

RESUMO

OBJECTIVES: There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data. SETTING: Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method. PARTICIPANTS: There were 1543 patients in the control arm that were included in our analysis. PRIMARY AND SECONDARY OUTCOME MEASURES: Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets. RESULTS: Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1). CONCLUSIONS: The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets. TRIAL REGISTRATION NUMBER: NCT00079274.


Assuntos
Intervalo Livre de Doença , Humanos , Intervalo Livre de Progressão , Modelos de Riscos Proporcionais
11.
JAMIA Open ; 4(1): ooab012, 2021 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-33709065

RESUMO

BACKGROUND: Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. OBJECTIVES: Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data. METHODS: A gradient boosted classification tree was built to predict death using Ontario's 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data. RESULTS: The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941-0.948] and 0.34 (95% CI, 0.313-0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936-0.944) and 0.313 (95% CI, 0.286-0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low. CONCLUSIONS: This synthetic dataset could be used as a proxy for the real dataset.

12.
J Am Med Inform Assoc ; 28(1): 3-13, 2021 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-33186440

RESUMO

OBJECTIVE: With the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the generated data is dependent on the variable order. No assessments of the impact of variable order on synthesized clinical trial data have been performed thus far. Through simulation, we aim to evaluate the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implement an optimization algorithm to find a good order if variability is too high. MATERIALS AND METHODS: Six oncology clinical trial datasets were evaluated in a simulation. Three utility metrics were computed comparing real and synthetic data: univariate similarity, similarity in multivariate prediction accuracy, and a distinguishability metric. Particle swarm was implemented to optimize variable order, and was compared with a curriculum learning approach to ordering variables. RESULTS: As the number of variables in a clinical trial dataset increases, there is a pattern of a marked increase in variability of data utility with order. Particle swarm with a distinguishability hinge loss ensured adequate utility across all 6 datasets. The hinge threshold was selected to avoid overfitting which can create a privacy problem. This was superior to curriculum learning in terms of utility. CONCLUSIONS: The optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets.


Assuntos
Ensaios Clínicos como Assunto , Anonimização de Dados , Conjuntos de Dados como Assunto , Disseminação de Informação/métodos , Algoritmos , Análise de Variância , Confidencialidade , Humanos
13.
J Med Internet Res ; 22(11): e23139, 2020 11 16.
Artigo em Inglês | MEDLINE | ID: mdl-33196453

RESUMO

BACKGROUND: There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE: The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS: A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this "meaningful identity disclosure risk." The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS: The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. CONCLUSIONS: We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.


Assuntos
COVID-19/epidemiologia , Revelação/normas , Disseminação de Informação/métodos , Humanos , Reprodutibilidade dos Testes , Fatores de Risco , SARS-CoV-2
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...